{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "0",
   "metadata": {},
   "source": [
    "# Agent Chat with Multimodal Models: LLaVA\n",
    "\n",
    "This notebook uses **LLaVA** as an example for the multimodal feature. More information about LLaVA can be found in their [GitHub page](https://github.com/haotian-liu/LLaVA)\n",
    "\n",
    "\n",
    "This notebook contains the following information and examples:\n",
    "\n",
    "1. Setup LLaVA Model\n",
    "    - Option 1: Use [API calls from `Replicate`](#replicate)\n",
    "    - Option 2: Setup [LLaVA locally (requires GPU)](#local)\n",
    "2. Application 1: [Image Chat](#app-1)\n",
    "3. Application 2: [Figure Creator](#app-2)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "1",
   "metadata": {},
   "source": [
    "### Before everything starts, install AG2 with the `lmm` option\n",
    "```bash\n",
    "pip install \"autogen[lmm]>=0.3.0\"\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2",
   "metadata": {},
   "outputs": [],
   "source": [
    "# We use this variable to control where you want to host LLaVA, locally or remotely?\n",
    "# More details in the two setup options below.\n",
    "import os\n",
    "\n",
    "import matplotlib.pyplot as plt\n",
    "from PIL import Image\n",
    "\n",
    "import autogen\n",
    "from autogen import Agent, AssistantAgent, LLMConfig\n",
    "from autogen.agentchat.contrib.llava_agent import LLaVAAgent, llava_call\n",
    "\n",
    "LLAVA_MODE = \"remote\"  # Either \"local\" or \"remote\"\n",
    "assert LLAVA_MODE in [\"local\", \"remote\"]"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "3",
   "metadata": {},
   "source": [
    "<a id=\"replicate\"></a>\n",
    "## (Option 1, preferred) Use API Calls from Replicate [Remote]\n",
    "We can also use [Replicate](https://replicate.com/yorickvp/llava-13b/api) to use LLaVA directly, which will host the model for you.\n",
    "\n",
    "1. Run `pip install replicate` to install the package\n",
    "2. You need to get an API key from Replicate from your [account setting page](https://replicate.com/account/api-tokens)\n",
    "3. Next, copy your API token and authenticate by setting it as an environment variable:\n",
    "    `export REPLICATE_API_TOKEN=<paste-your-token-here>` \n",
    "4. You need to enter your credit card information for Replicate 🥲\n",
    "    "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4",
   "metadata": {},
   "outputs": [],
   "source": [
    "# pip install replicate\n",
    "# import os\n",
    "# alternatively, you can put your API key here for the environment variable.\n",
    "# os.environ[\"REPLICATE_API_TOKEN\"] = \"r8_xyz your api key goes here~\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5",
   "metadata": {},
   "outputs": [],
   "source": [
    "if LLAVA_MODE == \"remote\":\n",
    "    llava_config = {\n",
    "        \"model\": \"whatever, will be ignored for remote\",  # The model name doesn't matter here right now.\n",
    "        \"api_key\": \"None\",  # Note that you have to setup the API key with os.environ[\"REPLICATE_API_TOKEN\"]\n",
    "        \"base_url\": \"yorickvp/llava-13b:2facb4a474a0462c15041b78b1ad70952ea46b5ec6ad29583c0b29dbd4249591\",\n",
    "    }"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "6",
   "metadata": {},
   "source": [
    "<a id=\"local\"></a>\n",
    "## [Option 2] Setup LLaVA Locally\n",
    "\n",
    "\n",
    "## Install the LLaVA library\n",
    "\n",
    "Please follow the LLaVA GitHub [page](https://github.com/haotian-liu/LLaVA/) to install LLaVA.\n",
    "\n",
    "\n",
    "#### Download the package\n",
    "```bash\n",
    "git clone https://github.com/haotian-liu/LLaVA.git\n",
    "cd LLaVA\n",
    "```\n",
    "\n",
    "#### Install the inference package\n",
    "```bash\n",
    "conda create -n llava python=3.10 -y\n",
    "conda activate llava\n",
    "pip install --upgrade pip  # enable PEP 660 support\n",
    "pip install -e .\n",
    "```\n",
    "\n",
    "\n",
    "\n",
    "Some helpful packages and dependencies:\n",
    "```bash\n",
    "conda install -c nvidia cuda-toolkit\n",
    "```\n",
    "\n",
    "\n",
    "### Launch\n",
    "\n",
    "In one terminal, start the controller first:\n",
    "```bash\n",
    "python -m llava.serve.controller --host 0.0.0.0 --port 10000\n",
    "```\n",
    "\n",
    "\n",
    "Then, in another terminal, start the worker, which will load the model to the GPU:\n",
    "```bash\n",
    "python -m llava.serve.model_worker --host 0.0.0.0 --controller http://localhost:10000 --port 40000 --worker http://localhost:40000 --model-path liuhaotian/llava-v1.5-13b\n",
    "``"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Run this code block only if you want to run LlaVA locally\n",
    "if LLAVA_MODE == \"local\":\n",
    "    llava_config = {\n",
    "        \"model\": \"llava-v1.5-13b\",\n",
    "        \"api_key\": \"None\",\n",
    "        \"base_url\": \"http://0.0.0.0:10000\",\n",
    "    }"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "8",
   "metadata": {},
   "source": [
    "# Multimodal Functions\n",
    "\n",
    "We cal test the `llava_call` function with the following AG2 image.\n",
    "![](../../../../../assets/img/autogen_agentchat.png)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9",
   "metadata": {},
   "outputs": [],
   "source": [
    "rst = llava_call(\n",
    "    \"Describe this AG2 framework <img /static/img/autogen_agentchat.png> with bullet points.\",\n",
    "    llm_config=LLMConfig(llava_config, temperature=0),\n",
    ")\n",
    "\n",
    "print(rst)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "10",
   "metadata": {},
   "source": [
    "<a id=\"app-1\"></a>\n",
    "## Application 1: Image Chat\n",
    "\n",
    "In this section, we present a straightforward dual-agent architecture to enable user to chat with a multimodal agent.\n",
    "\n",
    "\n",
    "First, we show this image and ask a question.\n",
    "![](https://th.bing.com/th/id/R.422068ce8af4e15b0634fe2540adea7a?rik=y4OcXBE%2fqutDOw&pid=ImgRaw&r=0)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "11",
   "metadata": {},
   "source": [
    "Within the user proxy agent, we can decide to activate the human input mode or not (for here, we use human_input_mode=\"NEVER\" for conciseness). This allows you to interact with LLaVA in a multi-round dialogue, enabling you to provide feedback as the conversation unfolds."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "12",
   "metadata": {},
   "outputs": [],
   "source": [
    "image_agent = LLaVAAgent(\n",
    "    name=\"image-explainer\",\n",
    "    max_consecutive_auto_reply=10,\n",
    "    llm_config=LLMConfig(llava_config, temperature=0.5, max_new_tokens=1000),\n",
    ")\n",
    "\n",
    "user_proxy = autogen.UserProxyAgent(\n",
    "    name=\"User_proxy\",\n",
    "    system_message=\"A human admin.\",\n",
    "    code_execution_config={\n",
    "        \"last_n_messages\": 3,\n",
    "        \"work_dir\": \"groupchat\",\n",
    "        \"use_docker\": False,\n",
    "    },  # Please set use_docker=True if docker is available to run the generated code. Using docker is safer than running the generated code directly.\n",
    "    human_input_mode=\"NEVER\",  # Try between ALWAYS or NEVER\n",
    "    max_consecutive_auto_reply=0,\n",
    ")\n",
    "\n",
    "# Ask the question with an image\n",
    "user_proxy.initiate_chat(\n",
    "    image_agent,\n",
    "    message=\"\"\"What's the breed of this dog?\n",
    "<img https://th.bing.com/th/id/R.422068ce8af4e15b0634fe2540adea7a?rik=y4OcXBE%2fqutDOw&pid=ImgRaw&r=0>.\"\"\",\n",
    ")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "13",
   "metadata": {},
   "source": [
    "Now, input another image, and ask a followup question.\n",
    "\n",
    "![](https://th.bing.com/th/id/OIP.29Mi2kJmcHHyQVGe_0NG7QHaEo?pid=ImgDet&rs=1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "14",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Ask the question with an image\n",
    "user_proxy.send(\n",
    "    message=\"\"\"What is this breed?\n",
    "<img https://th.bing.com/th/id/OIP.29Mi2kJmcHHyQVGe_0NG7QHaEo?pid=ImgDet&rs=1>\n",
    "\n",
    "Among the breeds, which one barks less?\"\"\",\n",
    "    recipient=image_agent,\n",
    ")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "15",
   "metadata": {},
   "source": [
    "<a id=\"app-2\"></a>\n",
    "## Application 2: Figure Creator\n",
    "\n",
    "Here, we define a `FigureCreator` agent, which contains three child agents: commander, coder, and critics.\n",
    "\n",
    "- Commander: interacts with users, runs code, and coordinates the flow between the coder and critics.\n",
    "- Coder: writes code for visualization.\n",
    "- Critics: LLaVA-based agent that provides comments and feedback on the generated image."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "16",
   "metadata": {},
   "outputs": [],
   "source": [
    "class FigureCreator(AssistantAgent):\n",
    "    def __init__(self, n_iters=2, **kwargs):\n",
    "        \"\"\"Initializes a FigureCreator instance.\n",
    "\n",
    "        This agent facilitates the creation of visualizations through a collaborative effort among its child agents: commander, coder, and critics.\n",
    "\n",
    "        Parameters:\n",
    "            - n_iters (int, optional): The number of \"improvement\" iterations to run. Defaults to 2.\n",
    "            - **kwargs: keyword arguments for the parent AssistantAgent.\n",
    "        \"\"\"\n",
    "        super().__init__(**kwargs)\n",
    "        self.register_reply([Agent, None], reply_func=FigureCreator._reply_user, position=0)\n",
    "        self._n_iters = n_iters\n",
    "\n",
    "    def _reply_user(self, messages=None, sender=None, config=None):\n",
    "        if all((messages is None, sender is None)):\n",
    "            error_msg = f\"Either {messages=} or {sender=} must be provided.\"\n",
    "            logger.error(error_msg)  # noqa: F821\n",
    "            raise AssertionError(error_msg)\n",
    "\n",
    "        if messages is None:\n",
    "            messages = self._oai_messages[sender]\n",
    "\n",
    "        user_question = messages[-1][\"content\"]\n",
    "\n",
    "        # Define the agents\n",
    "        commander = AssistantAgent(\n",
    "            name=\"Commander\",\n",
    "            human_input_mode=\"NEVER\",\n",
    "            max_consecutive_auto_reply=10,\n",
    "            system_message=\"Help me run the code, and tell other agents it is in the <img result.jpg> file location.\",\n",
    "            is_termination_msg=lambda x: x.get(\"content\", \"\").rstrip().endswith(\"TERMINATE\"),\n",
    "            code_execution_config={\n",
    "                \"last_n_messages\": 3,\n",
    "                \"work_dir\": \".\",\n",
    "                \"use_docker\": False,\n",
    "            },  # Please set use_docker=True if docker is available to run the generated code. Using docker is safer than running the generated code directly.\n",
    "            llm_config=self.llm_config,\n",
    "        )\n",
    "\n",
    "        critics = LLaVAAgent(\n",
    "            name=\"Critics\",\n",
    "            system_message=\"\"\"Criticize the input figure. How to replot the figure so it will be better? Find bugs and issues for the figure.\n",
    "            Pay attention to the color, format, and presentation. Keep in mind of the reader-friendliness.\n",
    "            If you think the figures is good enough, then simply say NO_ISSUES\"\"\",\n",
    "            llm_config=LLMConfig(llava_config),\n",
    "            human_input_mode=\"NEVER\",\n",
    "            max_consecutive_auto_reply=1,\n",
    "            #     use_docker=False,\n",
    "        )\n",
    "\n",
    "        coder = AssistantAgent(\n",
    "            name=\"Coder\",\n",
    "            llm_config=self.llm_config,\n",
    "        )\n",
    "\n",
    "        coder.update_system_message(\n",
    "            coder.system_message\n",
    "            + \"ALWAYS save the figure in `result.jpg` file. Tell other agents it is in the <img result.jpg> file location.\"\n",
    "        )\n",
    "\n",
    "        # Data flow begins\n",
    "        commander.initiate_chat(coder, message=user_question)\n",
    "        img = Image.open(\"result.jpg\")\n",
    "        plt.imshow(img)\n",
    "        plt.axis(\"off\")  # Hide the axes\n",
    "        plt.show()\n",
    "\n",
    "        for i in range(self._n_iters):\n",
    "            commander.send(message=\"Improve <img result.jpg>\", recipient=critics, request_reply=True)\n",
    "\n",
    "            feedback = commander._oai_messages[critics][-1][\"content\"]\n",
    "            if feedback.find(\"NO_ISSUES\") >= 0:\n",
    "                break\n",
    "            commander.send(\n",
    "                message=\"Here is the feedback to your figure. Please improve! Save the result to `result.jpg`\\n\"\n",
    "                + feedback,\n",
    "                recipient=coder,\n",
    "                request_reply=True,\n",
    "            )\n",
    "            img = Image.open(\"result.jpg\")\n",
    "            plt.imshow(img)\n",
    "            plt.axis(\"off\")  # Hide the axes\n",
    "            plt.show()\n",
    "\n",
    "        return True, \"result.jpg\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "17",
   "metadata": {},
   "outputs": [],
   "source": [
    "gpt4_llm_config = autogen.LLMConfig.from_json(path=\"OAI_CONFIG_LIST\", cache_seed=42).where(\n",
    "    model=[\"gpt-4\", \"gpt-4-0314\", \"gpt4\", \"gpt-4-32k\", \"gpt-4-32k-0314\", \"gpt-4-32k-v0314\"]\n",
    ")\n",
    "\n",
    "# gpt35_llm_config = autogen.LLMConfig.from_json(\n",
    "#     path=\"OAI_CONFIG_LIST\", cache_seed=42\n",
    "# ).where(model=[\"gpt-35-turbo\", \"gpt-3.5-turbo\"])\n",
    "\n",
    "\n",
    "creator = FigureCreator(name=\"Figure Creator~\", llm_config=gpt4_llm_config)\n",
    "\n",
    "user_proxy = autogen.UserProxyAgent(\n",
    "    name=\"User\", human_input_mode=\"NEVER\", max_consecutive_auto_reply=0, code_execution_config={\"use_docker\": False}\n",
    ")  # Please set use_docker=True if docker is available to run the generated code. Using docker is safer than running the generated code directly.\n",
    "\n",
    "user_proxy.initiate_chat(\n",
    "    creator,\n",
    "    message=\"\"\"\n",
    "Plot a figure by using the data from:\n",
    "https://raw.githubusercontent.com/vega/vega/main/docs/data/seattle-weather.csv\n",
    "\n",
    "I want to show both temperature high and low.\n",
    "\"\"\",\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "18",
   "metadata": {},
   "outputs": [],
   "source": [
    "if os.path.exists(\"result.jpg\"):\n",
    "    os.remove(\"result.jpg\")  # clean up"
   ]
  }
 ],
 "metadata": {
  "front_matter": {
   "description": "Leveraging multimodal models like llava.",
   "tags": [
    "multimodal",
    "llava"
   ]
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.11"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
